A1. We replace the bias by a weight from a unit that always has state=1. Yes: we can also do this with other types of neural networks. A2. It's a unit that detects some pattern in the input, and there are some input transformations (e.g. small translation, rotation) that have no effect on the value of the unit. A3. RNNs and CNNs both have weight sharing, and this has consequences for how we compute gradients. Most neural networks don't have this. A4. An autoregressive model is a time sequence model that bases its prediction on a fixed number of past inputs. It's called "memoryless" because it doesn't have long-term memory. A5. Pooling is taking a few (usually 2x2) neighboring instantiations of a replicated feature detector, and summarizing their activities by their sum, mean, or max, as input to the next layer. A6. Too large: we get oscillating or diverging weights. Too small: we learn too slowly (it takes too much time). B1a. We'd get zero gradients, which prevents learning. B1b. When z<0 we still get zero gradients, but when z>0 we get nice gradients. B2a. These data points are not linearly separable. (to do: draw) B2b. Bias: -4. W1: 2. W2: 3. (wrong: bias=-1, w1=1, w2=1) B3a. yi = exp(zi) / (exp(z1) + exp(z2)) B3b. y1 = exp(z1) / (exp(z1) + exp(0)) y1 = exp(z1) / (exp(z1) + 1) Now divide numerator & denominator by exp(z1) y1 = (exp(z1)/exp(z1)) / ( (exp(z1) + 1) / exp(z1) ) Now just simplify. y1 = 1 / ( (exp(z1) + 1) / exp(z1) ) y1 = 1 / (exp(z1)/exp(z1) + 1/exp(z1)) y1 = 1 / (1 + exp(-z1)) QED. B4a. The input to the hidden unit is 0. The state of the hidden unit is logistic(0)=1/2. The input (and state) of the output unit is 1.5. With a target of 0.5 this gives E=0.5. B4b. Long story but the answer is 3/4. B5a. No. It's not what we're improving directly. B5b. Yes (because we use a small enough learning rate). B5c. No. We might instead get trapped in a local minimum. B5d. No. It's not what we're optimizing (though it's strongly related to what we're optimizing). B6. Sorry I can't draw here.